Asymptotic Distribution of Certain Types of Entropy under the Multinomial Law

Simple Summary We obtain expressions for the asymptotic distributions of the Rényi and Tsallis of order q entropies, and Fisher information when computed on the maximum likelihood estimator of probabilities from multinomial random samples. We recall results related to the Shannon entropy. We build a test for comparing entropies of different types and categories. Abstract We obtain expressions for the asymptotic distributions of the Rényi and Tsallis of order q entropies and Fisher information when computed on the maximum likelihood estimator of probabilities from multinomial random samples. We verify that these asymptotic models, two of which (Tsallis and Fisher) are normal, describe well a variety of simulated data. In addition, we obtain test statistics for comparing (possibly different types of) entropies from two samples without requiring the same number of categories. Finally, we apply these tests to social survey data and verify that the results are consistent but more general than those obtained with a χ2 test.


Introduction
The multinomial distribution is an adequate model for describing how observations fall into categories. Quoting Johnson et al. [1], "The Multinomial distribution, like the Multivariate Normal distribution among the continuous multivariate distributions, consumed a sizable amount of the attention that numerous theoretical as well as applied researchers directed towards the area of discrete multivariate distributions." The entropy of a (multivariate, in our case) random variable is a substantial quantity. It quantifies the predictability of a system whose outputs can be described by such a model. Entropy has several definitions, both conceptual and mathematical. The concept of entropy originated as a way to relate a system's energy and temperature [2]. The same concept was used to describe the number of ways the particles of a system can be arranged.
Entropy has been seldom studied as a random variable. Hutcheson [3] and Hutcheson and Shenton [4] discussed the exact expected value and variance of the Shannon entropy under the multinomial model. These works also provided approximate expressions that circumvent the numerical issues when using the exact value. Jacquet and Szpankowski [5] studied high-quality analytic approximations of the Rényi entropy, of which the Shannon entropy is a particular case, under the binomial model. With the same approach, Cichoń and Golębiewski [6] obtained expressions for more general functionals that include the multinomial distribution. These works treat the entropy as a fixed quantity. Cook et al. [7] studied almost unbiased estimators of functions of the parameter of the binomial distribution. The authors extended those results to find an almost-unbiased estimator for the entropy under multinomial laws.
Chagas et al. [8] treated the Shannon entropy as a random variable. The authors obtained its asymptotic distribution when indexing by the maximum likelihood estimators of the proportions under the multinomial distribution. This result allowed the devising of unilateral and bilateral tests for comparing the entropy from two samples in a very general way. These tests do not require having the same number of categories.
In this work, our attention is directed toward the asymptotic distribution of other forms of entropy under the multinomial model. This allows the comparison of large samples throughout their entropies and, with this, they may have different numbers of classes. The comparison also allows using different types of entropy. We firstly apply the multivariate delta method and, in the case of the Rényi entropy, we transform the resulting multivariate normal distribution into that of the logarithm of the absolute value of a normally distributed random variable. Then, we provide the general expression of a test statistic that suits our needs. This paper unfolds as follows. Section 2 recalls the main properties of the multinomial distribution and defines the four types of entropies we will study. In Section 3, we present the central results, i.e., the asymptotic distribution of those entropies. We describe the techniques we used and left for Appendix A.1 technical details. We validate our results with simulation studies in Section 4: we show the adequacy of the normal distribution as limit law for the entropies under three probability models of different support, considering various sample sizes. In Section 5, we show that these asymptotic properties lead to a helpful hypothesis test between samples with different categories. We conclude the article in Section 6. Appendix A.2 comments on applications that justify our choices of the number of categories and sample sizes in the simulation studies. Appendix A.3 discloses relevant computational information, including reproducibility.

Entropies and the Multinomial Distribution
Consider a series of n independent trials, where only one of k mutually exclusive events π 1 , π 2 , . . . , π k must be observed in each one, with probability p = {p 1 , p 2 , . . . , p k } such that p ≥ 0 and ∑ k =1 p = 1. Let N = (N 1 , N 2 , . . . , N k ) be the random vector that counts the number of occurrences of the events π 1 , π 2 , . . . , π k in the n trials, with N ≥ 0 and ∑ k =1 N = n. A sample from N, say n, is a k-variate vector of integer values n = (n 1 , n 2 , . . . , n k ). Then, the joint distribution of N is We denote this situation as N ∼ Mult(n, p).
In practice, one does not know the true values of p, the probabilities that index this multinomial distribution. Such values are estimated by computing p , the proportion of times the class (category, event) π was observed among the k possible categories π = {π 1 , π 2 , . . . , π k } during the n trials. The maximum likelihood estimator for p = ( p 1 , p 2 , . . . , p k ) is the random vector of proportions. This maximum likelihood estimator coincides with the intuitive estimator based on the distribution's first moments, and is the most frequently used in applications.
We study the distribution of several forms of entropy of the random vector p for fixed k. Notice that p is computed over a single k-variate measurement of random proportions corresponding to a single random sample from N ∼ Mult(n, p). The asymptotic behaviors we derive hold for typical cases in which n k. The Shannon entropy measures the disorder or unpredictability of systems characterized by a probability distribution. On the one hand, the minimum Shannon value occurs when there is complete knowledge about the system behavior and total confidence in predicting the following observation. On the other hand, when a uniform distribution describes the system's behavior, that is, when all possibilities have the same probability of occurrence, the knowledge about the behavior of the data is minimal. In Chagas et al. [8], we studied the asymptotic distribution of the Shannon entropy. In this work, we extend those results to three other forms of entropy.
Other types of descriptors have been proposed in the literature to extract additional information not captured by the Shannon entropy. Tsallis [9] and Rényi [10], for instance, proposed parametric versions, which include the Shannon entropy.
Fisher information [11] is defined by an average logarithm derivative of a continuous probability density function. In the case of discrete densities, this measure can be approximated using differences of probabilities between consecutive distribution elements. While the Shannon entropy captures the degree of unpredictability of a system, the Fisher information is related to the rate of change of consecutive observations and, thus, quantifies small changes and perturbations.
Given a type of entropy H, we are interested in the distribution of H(p) when indexed by p, the maximum likelihood estimator of p. Our problem then becomes finding the distribution of H( p) for the following:

•
The Shannon entropy • The Tsallis entropy with index q ∈ R \ {1} • The Rényi entropy of order q ∈ R + \ {1} • The Fisher information, also termed "Fisher Information Measure" in the literature, with renormalization coefficient Among other possibilities, we used Equation (2.7) from Ref. [12].

Asymptotic Distributions of Entropies
The main results of this section are the asymptotic distributions of the Shannon (2), Tsallis of order q (3), and Rényi of order q (4) entropies, and Fisher information (5). These results are presented, respectively, in Equations (30)-(32) and (35). Notice that the Rényi entropy is not asymptotically normally distributed, while the other three are.
We recall the following theorems known respectively as the delta method and its multivariate version. We refer to Lehmann and Casella [13] for their proofs. Theorem 1. Let X n be a sequence of independent and identically distributed random variables such that √ n[X n − θ] converges in distribution to a N (0, σ 2 ). If ∂h/∂θ exists and does not vanish, then √ n[h(X n ) − h(θ)] converges in distribution to a N (0, σ 2 [∂h/∂θ] 2 ). Theorem 2. Let X n = (X 1n , X 2n , . . . , X kn ) be a sequence of independent and identically distributed vectors of random variables such that distribution to a multivariate normal distribution N n (0, Σ), where Σ is the covariance matrix. Suppose that h 1 , h 2 , . . . , h k are real functions continuously differentiable in a neighborhood of the parameter point θ = (θ 1 , θ 2 , . . . , θ k ) and such that the matrix of partial derivatives B = (∂h /∂θ  ) k ,=1 is non-singular in the mentioned neighborhood. Then, the following convergence in distribution holds: where B denotes the transpose of B.
Now, we focus on the case N ∼ Mult(n, p). Let p = N/n be the vector of sample proportions which coincides with the maximum likelihood estimator (MLE) of p and Let us explore the covariance matrix in this case: It means that the covariance matrix Σ p ∈ R k×k we are interested in is of the form The above statements are generalized. In the following, we obtain new results for the Tsallis and Rényi entropies, and for the Fisher information. For the sake of completeness, we also include the results for the Shannon entropy.
Using the limit distribution presented in (29) and a = (−1, −1, . . . , −1), we directly have the asymptotic distribution of the Shannon entropy as follows: With similar arguments and a = (1, 1, . . . , 1), we obtain the asymptotic distribution for the Tsallis entropy of order q: The procedure is analogous for the Fisher information but with a = (1, 1, . . . , 1) ∈ R k−1 . Hence, it can be proved that where To obtain expression (33), we use the symmetry of the covariance matrix which implies It is worth noticing that the expression of the covariance matrix for Fisher information is more complicated than the previously analyzed entropies since the matrix of partial derivatives is not diagonal in this case.
The case of Rényi entropy is different because, following the previous methodology, we can prove that Hence, where Notice that this is not a normal distribution but the distribution of the logarithm of the absolute value of a normally distributed random variable.
Often, in practice, these entropies are scaled to be in [0, 1]; these are called "normalized entropies". The following modifications must be considered in the normalized versions of the entropies. For the normalized Shannon entropy, the asymptotic mean and variance are multiplied by 1/ log k and 1/(log k) 2 , respectively. In the case of the normalized Tsallis entropy, the asymptotic mean and variance are multiplied by (q − 1)/(1 − k 1−q ) and (q − 1) 2 /(1 − k 1−q ) 2 , respectively. Finally, the asymptotic distribution of the normalized Rényi entropy is P q R (x) = log kP q R (x log k). Notice that normalized entropies do not depend on the logarithm basis. The Fisher information is, as defined in (5), already normalized.

Analysis and Validation
In this section, we study the empirical distribution of the entropies computed from p under three models, four categories (k ∈ {6, 24, 120, 720}), and three sample sizes (n ∈ {10 2 k, 10 3 k, 10 4 k}) that depend on the number of categories. These choices of k and n are based on the values that appear in signal analysis with ordinal patterns; see details of this technique in Appendix A.2.
These probability functions are illustrated, for k = 6 and = 0.3, in Figure 1. We studied the behavior of the Shannon entropy, the Rényi entropy with q ∈ {1/3, 2/5}, the Tsallis entropy with q ∈ {1/2, 3/2}, and the Fisher information computed on samples of sizes n ∈ {10 2 k, 10 3 k, 10 4 k}. We used 300 independent samples (replicates).  Although Equation (35) shows that the Rényi entropy is not asymptotically normal, we verified that its density is similar to that of a Gaussian distribution. With this in mind, we also checked of the normality of Rényi entropies. We used the Anderson-Darling test to verify the null hypothesis that the data follow a normal distribution. We chose this test because it uses the hypothesized distribution in calculating critical values. This test is more sensitive than other alternatives; see, for instance, the book by Lehman and Romano [14].
From Table 1, we notice that the Fisher information is the one that fails most times to pass the normality test at the 1 %. The situation that appears with p-value = 0.0010 in the table has, in fact, p-value = 9.606 130 × 10 −3 ; the table shows rounded values. Figure 2 shows four of these cases, namely for k = 6, n = 600, and = 0.1, 0.3, 0.5, 0.8. We notice that the deviation from the normal hypothesis is more prevalent in both tails, being that the observations are larger than the theoretical quantiles. Table 1. Situations for which the p-values of the Anderson-Darling test for the normality of samples of size 300 are less than 0.01 ("HF" stands for the Fisher information; "HaH" and "OAZ" are the Half-And-Half and One-Almost-Zero models). The normality hypothesis was rejected at the 1% level by the Anderson-Darling test in only 24 out of 432 situations, showing that the asymptotic Gaussian model for the entropies is a good description for these data. Table 1 shows those situations. With the aim to assess the goodness of fit of the asymptotic models, we applied the Kolmogorov-Smirnov test to fifty replicates of samples. Table 2 shows the results where the p-value of the test is at least equal to 0.05.
It is worth noticing that even in those cases where the p-value is lesser than 0.05, the asymptotic models are a good fit to the data as can be seen in several examples exhibited in Figure 3. The Fisher information shows the worst fitting. Additionally, notice in Figure 3d that, although the asymptotic distribution of the Rényi entropy is not normal, the probability density function is visually very close to the Gaussian model. We verified this similarity in all the cases we considered. Table 2. Situations for which the p-values of the Kolmogorov-Smirnov test of samples of size 50 are larger than or equal to 0.05 ("HaH" and "OAZ" are the Half-And-Half and One-Almost-Zero models).

Application
Inspired by an example from Agresti [16] (p. 200), we extracted data from the General Social Survey (GSS, a project of the independent research organization NORC at the University of Chicago, with principal funding from the National Science Foundation, available at https://gss.norc.org/. The data were downloaded on 24 December 2022). Table 3 shows the level of agreement to the assertion "Religious people are often too intolerant" as measured in three years. The p-values of pairwise χ 2 tests for the null hypotheses that the underlying probabilities are equal are On the one hand, these values attest that 1998 and 2008 and 1998 and 2018 are very different. On the other hand, although significant, the change between 2008 and 2018 is not so significant. Table 4 shows the asymptotic mean and variance (in entropies normalized units) of the entropies of the proportions reported in Table 3. We perform the same hypothesis test with the asymptotic quantities presented in Table 4. Table 5 shows the p-values of the null hypothesis that the entropies are equal, using the test discussed by Chagas et al. [8] (Section 5): where Φ is the cumulative distribution function of a standard normal random variable, H is any of the considered entropies computed with the observed proportions p i , i = 1, 2, and σ 2 n i , p i is the corresponding sample asymptotic variance that takes into account the sample size n i . Notice that the test based on entropies compares only these features, and not the underlying distribution. The results in Table 5 are consistent with those provided by the χ 2 tests, i.e., the most significant differences arise between 1998 and 2008 and between 1998 and 2018. Moreover, the tests based on entropies do not reject the null hypothesis in the pair 2008-2018, except for Rényi entropy of order 2/3. The increased p-values are a consequence of the information reduction: whereas the χ 2 test compares count-by-count, those based on entropies compare two scalars.
In the second part of this application, we will illustrate the use of test statistics based on entropies for comparing samples with different categories. Situations like this may appear when applying alternative versions of the same questionnaire in a series of surveys.
We collapsed the categories of 1998 into three: "agreement" (by adding "strongly agree" and "agree"), "indifference" ("not agree/disagree"), and "disagreement" (by adding "disagree" and "strong disagree"). The resulting asymptotic mean entropies and asymptotic variances are shown in Table 6.  Table 7 presents the p-values of the tests that verify the null hypothesis of the same entropy between the collapsed 1998 data (three categories), and 2008 and 2018 (five categories). These results agree with those presented in Table 5. Such an agreement suggests that, although the number of categories was reduced in 1998 from five to three, the tests based on entropies cope with the loss of information.

Conclusions
We presented expressions for the asymptotic distribution of the Rényi and Tsallis entropies of order q, and Fisher information. The Fisher information and the Tsallis and Shannon entropies have limit normal distribution with means and variances that depend on the underlying probability of patterns and the number of patterns. The Rényi entropy follows, asymptotically, a different distribution, cf. (35), but a Gaussian law can well approximate it. Those expressions pose no numerical challenges other than setting 0 log 0 . = 0. We verified that these asymptotic distributions are good models for data arising from both simulations with a variety of models and from the analysis of actual data.
On the one hand, the Fisher information is the one that fails more frequently to pass the Anderson-Darling normality tests. On the other hand, it does not provide evidence to reject the same hypothesis under the One-Almost-Zero model.
The distributions we present here can be used for building test statistics, as discussed by Chagas et al. [8]. Moreover, Equation (37) allows performing tests with mixed types of distributions, a situation that may appear in Internet of Things applications, in which, citing Borges et al. [17], one has to deal with "large time series data generated at different rates, of different types and magnitudes, possibly having issues concerning uncertainty, inconsistency, and incompleteness due to missing readings and sensor failures." Pesquisa do Estado de São Paulo (FAPESP), and project APQ-00426-22 from Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG).

Institutional Review Board Statement: Not applicable.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflicts of interest.

Notation
The following notation is used in this manuscript:  Analogous to the computation in Equation (A1), it can be seen that Symbolic data analysis [18] encompasses methods that study the statistical properties of data aggregated by criteria that meet some scientific question. Such methods have attracted lots of attention because they present competitive results in many data analysis applications [17,19,20].
Ordinal patterns [21] belong to this class of techniques. They impose low computational complexity and are inherently robust. This approach consists of constructing a set of symbolic ordinal patterns based on intrinsic data characteristics without any prior model. Ordinal patterns often reveal and quantify the underlying time series dynamics. In spite of their successful application to biomedicine, economics, mechanics and electronics engineering, image analysis and remote sensing, to name a few (see, for instance, Refs. [20,22,23]), little is known about the statistical properties of the features they induce. One of these features is entropy, in its several forms.
Signal analysis with ordinal patterns requires coding D observations into k = D! categories, in which D is typically small [8,20,24]. Motivated by these applications, we chose k ∈ {6, 24, 120, 720}, which allows checking results in various categories. Bear in mind that, when using ordinal patterns, the subsequent patterns are not independent and, thus, the multinomial distribution is an approximation.